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Abstract. The stochastic block model is a powerful tool for inferring 
community structure from network topology. However, it predicts a Pois- 
son degree distribution within each community, while most real-world 
networks have a heavy-tailed degree distribution. The degree-corrected 
block model can accommodate arbitrary degree distributions within com- 
munities. But since it takes the vertex degrees as parameters rather than 
generating them, it cannot use them to help it classify the vertices, and 
its natural generalization to directed graphs cannot even use the ori- 
entations of the edges. In this paper, we present variants of the block 
model with the best of both worlds: they can use vertex degrees and 
edge orientations in the classification process, while tolerating heavy- 
tailed degree distributions within communities. We show that for some 
networks, including synthetic networks and networks of word adjacencies 
in English text, these new block models achieve a higher accuracy than 
either standard or degree-corrected block models. 

Keywords: complex networks, community detection, generative model, 
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1 Introduction 

In many real-world networks, vertices can be divided into communities or mod- 
ules based on their connections. Social networks can be forged by interactions 
in daily activities like karate training [24]. The blogosphere contains groups of 
linked blogs with similar political views pQ. Words can be tagged as different 
parts of speech based on their adjacencies in large texts |17j . Communities range 
from assortative clumps, where vertices preferentially attach to others of the 
same type, to functional communities of vertices that connect to the rest of 
the network in similar ways, such as groups of predators in a food web that 
feed on similar prey [H [Hj . Understanding this variety of community structures, 
and their relationships to functional roles of vertices and edges, is crucial to 
understanding network data. 
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The stochastic block model (SBM) [lj3 E2 E2 [2] is a popular and highly 
flexible generative model for community detection. It partitions the vertices into 
communities or blocks, where vertices belonging to the same block are stochas- 
tically equivalent 23J in the sense that the probabilities of a connection with all 
other vertices are the same for all vertices in the same block. This definition of 
community is quite flexible, letting block models capture many types of commu- 
nity structure, including assortative, disassortative, and satellite communities 
and mixtures of them [HI [HI QU QU 18] . 

The SBM assumes that each edge is generated independently conditioned 
on the block memberships. Each entry A uv of the adjacency matrix is then 
Bernoulli-distributed, where the probability that A uv — 1 depends solely on the 
block memberships g Ul g v of its endpoints. Since every pair of vertices in a given 
pair of blocks are connected with the same probability, for large n the degree 
distribution within each block is Poisson. As a consequence, vertices with very 
different degrees are unlikely to be in the same block. This leads to problems 
when modeling real networks, which often have heavy-tailed degree distributions 
within each community. For instance, both liberal and conservative political 
blogs range from high-degree "leaders" to low-degree "followers" pQ. 

Recently, Karrcr and Newman [T3] developed the degree- corrected (DC) block 
model for undirected networks. They add a parameter for each vertex, which 
controls its expected degree. By setting these parameters equal to the observed 
degrees, the DC can accommodate arbitrary degree distributions within commu- 
nities. This removes the model's tendency to separate high-degree and low-degree 
vertices into different communities. Similar models were considered by M0rup 
and Hansen [16] and Reichardt, Alamino, and Saad [2Tj . 

On the other hand, the degree-corrected model cannot use the vertex degrees 
to help it classify the vertices, precisely because it takes the degrees as param- 
eters rather than as data that need to be explained. For this reason, DC may 
actually fail to recognize communities that differ significantly in their degree 
distributions. Thus we have two extremes: the SBM separates vertices by degree 
even when it shouldn't, and DC fails to do so even when it should. 

For directed graphs, the natural generalization of DC, the directed degree- 
corrected (DDC) block model, has two parameters for each vertex: the expected 
in-degree and out-degree. But this model cannot even take advantage of edge 
orientations. For instance, in English adjectives usually precede nouns but rarely 
vice versa. Thus the ratio of each vertex's in- and out-degree is strongly indicative 
of its block membership, and leveraging this part of the data is very helpful in 
the classification process. 

In this paper, we propose two new types of block model, which combine the 
strengths of the degree-corrected and uncorrected block models. The oriented 
degree- corrected (ODC) block model is able to utilize the edge orientations for 
community detection by only correcting the total degrees instead of the in- and 
out-degrees separately. We show that for networks with strongly asymmetric 
behavior between communities, including synthetic networks and networks of 
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word adjacencies in English text, ODC achieves a higher accuracy than either 
the original stochastic block model or the degree-corrected block model. 

We also propose the degree-generated (DG) block model, which treats the 
expected degree of each vertex as generated from prior distributions in each 
community, such as power laws whose exponents and cutoffs vary from one com- 
munity to another. By including the probability of these degrees in the likeli- 
hood of a given block assignment, the model captures the interaction between 
the degree distribution and the community structure. DG automatically strikes 
a balance between allowing vertices of different degrees to coexist in the same 
community on the one hand, and using vertex degrees to separate vertices into 
communities on the other. 

Our experiments show that DG works especially well in networks where com- 
munities have highly inhomogeneous degree distributions, but where the degree 
distributions differ enough between communities so that we can use vertex de- 
grees to help us classify the vertices. Both the standard and degree-corrected 
block models classify nodes solely on the basis of the relative density of connec- 
tions between communities, with different notions of "density." DG block models 
let us leverage degree information as well. In some cases, DG has a further advan- 
tage in faster convergence as it reshapes the landscape of the parameter space, 
providing the inference algorithm a shortcut to the correct community structure. 

These new variants of the block model give us the best of both worlds. They 
can tolerate heavy-tailed degree distributions within communities, but can also 
use degrees and edge orientations to help classify the vertices. In addition to their 
performance on these networks, our models illustrate a valuable point about 
generative models and statistical inference: when inferring the structure of a 
network, you can only use the information that you try to generate. 

2 The models 

In this section, we review the degree-corrected block model of [13] , and present 
our variations on it, namely oriented and degree-generated block models. 

2.1 Background: degree-corrected block models 

Throughout, we use N and M to denote the number of vertices and edges, and 
K to denote the number of blocks. The problem of determining the number of 
blocks is a subtle model selection problem, which we do not address here. 

In the original stochastic block model, the entries A uv of the adjacency matrix 
are independent and Bernoulli-distributed, with P(A UV = 1) = p gu ,g v - Here 
g u is the block to which u belongs, where p is a K x K matrix. Karrer and 
Newman [13 consider random multigraphs where the A uv are independent and 
Poisson-distributed, 
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Here w replaces p, and 6 U is an overall propensity for u to connect to other ver- 
tices. Note that since the A uv are independent, the degrees d u will vary some- 
what around their expectations; however, the resulting model is much simpler 
to analyze than one that controls the degree of each vertex exactly. 

Ignoring self-loops, the likelihood with which this degree-corrected (DC) 
block model generates an undirected multigraph G is then 

P(G\e,w,g) = I] (g " e "?"? )A "" cxp(-e u g„a, fl „ g J . (1) 

To remove the obvious symmetry where we multiply the 0's by a constant C and 
divide w by C 2 , we can impose a normalization constraint Y] u . g — r @ u = K r f° r 
each block r, where K r — ^2 u . gu=r d u is the total degree of the vertices in block 
r. Under these constraints, the maximum likelihood estimates (MLEs) for the 
parameters are U — d u . For each pair of blocks r, s, the MLE for u rs is then 

m rs 



where m rs is the number of edges connecting block r to block s (and edges within 
blocks are counted twice). Substituting these MLEs for 9 and w then gives the 
log-likelihood 

\ogP{G\g) = \ ^ mrs l g^. (2) 

r.s=l 



2.2 Directed and oriented degree-corrected models 

The natural extension of the degree-corrected model to directed networks, which 
we call the directed degree-corrected block model (DDC), has two parameters 
8° ut , 9™ for each vertex. The number of directed edges from u to v is again 
Poisson-distributcd, 

^-PoCt'-'M.)- 

We impose the constraints J2 U :g u =r C"* = «° ut an d E M:ffll =r °u = 4" for each 
block r, where K° ut = J2 u -g u =r ^m"* an< ^ K T = H u -.g n =r *C denote the total out- 
and in-degree of block r. As before, let m rs denote the number of directed edges 
from block r to block s. Then the likelihood is 

( /lout ft\n . . \ 

P(G | e, U ,g) = I] [ u " ■ exp(-C t C- 9u9 J 

uv 

nj^ ut ) dr (C) C Urs "?s" exp(-< ut «> rs ) 

rr a ' [ ' 
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Ignoring constants, we get the log-likelihood as follows 

log P(G | 9, u , g) = ]T« ut log C lt + du log C) 

U 

+ J2( mrs l °S^rs - K° Ut lj*U rS ) • (4) 
rs 

The MLEs for the parameters (see Appendix [A]) are 

-■• * • HI 

0T = dT\ C = C, "rs = ^-. (5) 

Substituting these MLEs gives 



log P(G | g) = 



K 

E < 

r,s — l 



log- 



l^UUl t-.ll 



(0) 



In the DDC, the in- and out-degrees of each vertex are completely specified 
by the 9 parameters, at least in expectation. Thus the DDC lets vertices with 
arbitrary in- and out-degrees to fit comfortably together in the same block. On 
the other hand, since the degrees are given as parameters, rather than as data 
that the model must generate and explain, the DDC cannot use them to infer 
community structure. Indeed, it cannot even take advantage of the orientations 
of the edges, and as we will see below it performs quite poorly on networks with 
strongly asymmetric community structure. 

To deal with this, we present a partially degree-corrected block model capa- 
ble of taking advantage of edge orientations, which we call the oriented degree- 
corrected (ODC) block model. Following the maxim that we can only use the 
information that we try to generate, we correct only for the total degrees of the 
vertices, and generate the edges' orientations. 

Let G denote the undirected version of a directed graph G, i.e., the multi- 
graph resulting from erasing the arrows for each edge. Its adjacency matrix is 
A uv = A uv + A VU1 so (for instance) G has two edges between u and v if G 
had one pointing in each direction. The ODC can be thought of as generating 
G according to the undirected degree-corrected model, and then choosing the 
orientation of each edge according to another matrix p rs , where an edge (u,v) 
is oriented from u to v with probability p gu . g ^- Thus the total log- likelihood is 

log P(G | 9, u, p, g) = log P(G | 9, u, g) + log P(G \ G, p, g) . (7) 

Writing rh rs = m rs + m sr and K r = k" 1 + K° ut , we can set 9 U and ui rs for the 
undirected model to their MLEs as in Section [2T] giving 

log P(G\g) = \ J2 ^-log — • (8) 



() 
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The orientation term is 

log P(G | G, p, g) = ^ m rs log p rs = ^ Y (m rs log p rs + m sr log p sr ) , (9) 

rs rs 

For each r, s we have p rs + p sr = 1, and the MLEs for p are 

p rs = m rs /fh ra . (10) 

As ([9]) is maximized when the p rs are near or 1, the edge orientation term 
prefers highly directed inter-block connections. Since p rr — 1/2 for any r, it also 
prefers disassortative mixing, with as few connections as possible within blocks. 
Substituting the MLEs for p and combining ([8| with ^ , the total log-likelihood 
is 

K 

\ogP{G\g)= Y, m rs log^. (11) 

r,s=l 

We can also view the ODC as a special case of the DDC, where we add the 
constraint 9™ = 9° ut for all vertex u (see Appendix^. Moreover, if we set 9 U = 1 
for all u, we obtain the original block model, or rather its Poisson multigraph 
version where each A uv is Poisson-distributed with mean w ffu ,g^. Thus 

SBM < ODC < DDC , 

where A < B means that model A is a special case of model B, or that B is 
an elaboration of A. We will see below that since it is forced to explain edge 
orientations, the ODC performs better on some networks than either the simple 
SBM or the DDC. 



2.3 Degree-generated block models 

Another way to utilize vertex degrees for community detection is to require the 
model to generate them, but according to some distribution derived from domain 
knowledge or an overall measurement of the network's degree distribution. For 
instance, many real-world networks have a power-law degree distribution, but 
with parameters (such as the exponent, minimum degree, or leading constant) 
that vary from community to community. In that case, the degree of a vertex 
gives us a clue as to its block membership. Our degree- generated (DG) block 
models allow heavy-tailed degree distributions, unlike the simple block model, 
while taking advantage of vertex degrees to help it classify the vertices, unlike 
the degree-corrected model of Karrer and Newman. 

To maintain the tractability of the model, we do not generate the degrees 
directly. Instead, we generate the 9 parameters of one of the degree-corrected 
block models discussed above, and use them to generate a random multigraph. 
Specifically, each 9 U is generated independently according to some distribution 
whose parameters ip depend on the block g u to which u belongs. Thus the DG 
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model is a hierarchical model, which extends the previous degree-corrected block 
models by adding a degree generation stage on top, treating the 9s as generated 
by the block assignment g and the parameters tp rather than as parameters. 

We can apply this approach to the undirected, directed, or oriented versions 
of the degree-corrected model; at the risk of drowning the reader in acronyms, 
we denote these DG-DC, DG-DDC, and DG-ODC. In each case, the total log- 
likelihood of a graph G is 

log P(G \i/j,u,g)=log j d0 P(G \9,u, g) P(6\il>,g), 

where 

P(6\i>,g) = l[P(e u \iP gu ). 

u 

For the directed models, we use 6 U as a shorthand here for 9™ and #° ut . 

As in many hierarchical models, computing this integral appears to be diffi- 
cult, except when P(9 \ tp) has the form of a conjugate prior such as the Gamma 
distribution (see Appendix C). We approximate it by assuming that it is domi- 
nated by the most-likely value of 9, 

log P(G | tl>, u, g) « log P(G | 0, u, g) + log P(fi\i/>,g). 

However, even determining 9 is challenging where P(6 \ tp) is, say, a power law 
with a minimum-degree cutoff. Thus we make a further approximation, setting 
9 just by maximizing the block model term log P(G \ 6, w, g) as we did before, 
using ([5]) or the analogous equations for the DC or ODC. In essence, these 
approximations treat P{9 \ tp, g) as a penalty term, imposing a prior probability 
on the degree distribution of each community with hyperparameters tp. This 
leaves the door open for community structures that might not be as good a fit 
to the edges, but compensate with a much better fit to the degrees. 

We can either treat the degree-generating parameters tp as fixed (say, if they 
are predicted by a theoretical model of network growth [3 [14]) or infer them 
by finding the tp that maximizes P(6 | tp). For instance, suppose the 9 U in block 
g u = r arc distributed as a continuous power law with a lower cutoff # m i n ,r- 
Specifically, let the parameters in each block r be tp r = (cv r , j3 r , m in,r), and 

9 U = 

P{9u I A) = { U < 9u < dmi ^ r 



> 



out fruOut /pout /3out \ 

min.r / 



In the directed case, we have tp™ = (a™, 9™ in r ) and tP° ut = (a° ut ,/3: 
Allowing (3° ut to be nonzero, for instance, lets us directly include nodes with no 
outgoing neighbors; we find this useful in some networks. Alternately, we can 
choose (0 1 u,9° ut ) from some joint distribution, allowing in- and out-degrees to 
be correlated in various ways. 
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We fix 6> m i n)r = 1. Given the degrees and the block assignment, the MLEs for 
a r and /3 r are as follows. Let Y r = {u : g u = r and 6 U ^ 0}, and let y r = \Y r \. 
Then the most-likely exponent of the power law is [5] 




The MLE for r is simply the fraction of vertices in block r with degree zero. 



3 Experiments on synthetic networks 

In this section, we describe experiments on various synthetic networks. First, 
we generated undirected networks according to the DG-DC model, with two 
blocks or communities of equal size N/2, In order to confound the block model 
as much as possible, we deliberately designed these networks so that the two 
blocks have the same average degree. The degree distribution in block 1 is a 
power law with exponent a = 1.7, with an upper bound of 1850, so that the 
average degree is 20. The degree distribution in block 2 it is Poisson, also with 
mean 20. As described in Appendix D, the upper bound on the power law is 
larger than any degree actually appearing in the network; it really just changes 
the normalizing constant of the power law, and the MLE for a can still be 



calculated using (12 1. We assume the algorithm knows that one block has a 
power law degree distribution and the other is Poisson, but we force it to infer 
the parameters of these distributions. 

As in |13j . we use a parameter A to interpolate linearly between a fully 
random network with no community structure and a "planted" one where the 
communities are completely separated. Thus 

uj rs = Aw^ antcd + (1 - AKf dom 

where 



, .random _ K r K s p l an t c d f Kl 

2M I k 2 



We inferred the community structure with various models. We ran the Kernighan- 
Lin (KL) heuristic first to find a local optimum [13] . and then ran the heat-bath 
MCMC algorithm with fixed number of iterations to further refine it if ever possi- 
ble. We initialized each run with a random block assignment; to test the stability 
of the models, we also tried initializing them with the correct block assignment. 
Since isolated vertices don't participate in the community structure, giving us 
little or no basis on which we can classify them, we remove them and focus on 
the giant component. For A — 1, where the community structure is purely the 
"planted" one, we kept two giant components, one in each community. 

We measured accuracy by the normalized mutual information (NMI) [7] be- 
tween the most-likely block assignment found by the model and the correct 
assignment. To make this more concrete, if there are two blocks of equal size 
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and 95% of the vertices in each block are labeled correctly, the NMI is 0.714. If 
90% in each group are labeled correctly, the NMI is 0.531. For groups of unequal 
size, the NMI is a better measure of accuracy than the fraction of vertices labeled 
correctly, since one can make this fraction fairly large simply by assigning every 
vertex to the larger group. 

As shown in Fig. [T] DG-DC works very well even for small A. This is because 
it can classify most of the vertices simply based on their degrees; if d u is far 
from 20, for instance, then u is probably in block 1. As A increases, it uses 
the connections between communities as well, giving near-perfect accuracy for 
A > 0.6. It does equally well whether its initial assignment is correct or random. 

The DC model, in contrast, is unable to use the vertex degrees, and has 
accuracy near zero (i.e., not much better than a random block assignment) for 
A < 0.2. Like the SBM 0[9], it may have a phase transition at a critical value of 
A below which the community structure is undetectable. Initializing it with the 
correct assignment helps somewhat at these values of A, but even then it settles 
on an assignment far from the correct one. 

The original stochastic block model (SBM), which doesn't correct the de- 
grees, separates vertices with high degrees from vertices with low degrees. Thus 
it cannot find the correct group structure even for large A. Our synthetic tests 
are designed to have a broad degree distribution in block 1, and thus make SBM 
fail. Note that if the degree distribution in block 1 is a power-law with a larger 
exponent a, then most of the degrees will be much lower than 20, in which case 
SBM works reasonably well. 




Fig. 1. Tests on synthetic networks generated by the DG-DC model. Each point is 
based on 30 randomly generated networks with N = 2400. For each network and each 
model, we choose the best result from 10 independent runs, initialized either with 
random assignments (the suffix R) or the true block assignment (the suffix T). Each 
run consisted of the KL-heuristic followed by 10 6 MCMC steps. Our degree-generated 
(DG) block model performs much better on these networks than the degree-corrected 
(DC) model. The non-degree-corrected (SBM) model doesn't work at all. 



Next, we generated directed networks according to the DG-DDC model. We 
again have two blocks of equal size, with degree distributions similar to the 
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undirected networks tested above. In block 1, both out- and in-degrees are power- 
law distributed with a = 1.7, with an upper bound 1850 so that the expected 
degree is 20. In block 2, both out- and in-degrees are Poisson-distributcd with 
mean 20. To test our oriented and directed models, we interpolate between a 
random network I AM and a planted network with completely 

asymmetric connections between the blocks, 

panted = ( (*1 -«ia)/2 "12 \ ^ 

\ (k 2 - W12J/2 / 

where (J12 < min(/ti, K2). We choose W12 = ^mm(ni, « 2 ). 

As Fig. [2] shows, DG-ODC and DG-DDC have very similar performance at 
the extremes where A = and 1. However, DG-ODC works better than DG-DDC 
for other A, and both of them achieve much better accuracy than the ODC or 
DDC models. As in Fig. [I] the degree-generated models can achieve a high 
accuracy based simply on the vertex degrees, and as A grows they leverage this 
information further to achieve near-perfect accuracy for A > 0.8. 

Among the non-degree-corrected models, ODC performs significantly better 
than DDC for A > 0.4. Edges are more likely to point from block 1 to block 2 
than vice versa, and ODC can take advantage of this information while DDC 
cannot. As we will see in the next section, ODC performs well on some real-world 
networks for precisely this reason. 




Fig. 2. Tests on synthetic directed networks with TV = 2400. Left, DG-ODC and 
DG-DDC; right, ODC and DDC. The degree-generated models again perform very 
well even for small A, since they can use in- and out-degrees to classify the vertices. 
ODC performs significantly better than DDC for A > 0.4, since it can use the edge 
orientations to distinguish the two blocks. The number of networks, runs, and MCMC 
steps per run are as in Fig. [TJ 



4 Experiments on real networks 

In this section, we describe experiments on three word adjacency networks in 
which vertices are separated into two blocks: adjectives and nouns. The first 
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network consists of common words in Dickens' novel David Copperfield [20] . The 
other two are formed by adjectives and nouns in the Brown corpus, which is a 
tagged corpus of present-day edited American English across various categories, 
including news, novels, documents, and many others [11] . We build two different 
networks from the Brown corpus. The smaller one contains words in the News 
category (45 archives) that appeared at least 10 times; the larger one contains 
all the adjectives and nouns in the giant component of the entire corpus. 

We considered both the simple version of these networks where A uv = 1 if 
u and v ever occur together in that order, and the multigraph version where 
A U v > is the number of times they occur together. The sizes, block sizes, 
and number of edges of these networks are shown in Table [T] In "News" and 
"Brown", the block sizes are quite different, with more nouns than adjectives. 
As discussed above, the NMI is a better measure of accuracy than the fraction of 
vertices labeled correctly, since we could make the latter fairly large by labeling 
everything a noun. 

In each network, both blocks have heavy-tailed in- and out-degree distribu- 
tions. The connections between them are disassortative and highly asymmetric: 
since in English adjectives precede nouns more often than they follow them, and 
more often than adjectives precede adjectives or nouns precede nouns, u>\2 is 
roughly 10 times larger than W21, and W12 is larger than either u>n or W22. The 
uj for each network corresponding to the correct block assignment (according to 
the stochastic block model) is shown in Table [2j 



Table 1. Basic statistics of the three word adjacency networks. S and M denote the 
simple and multigraph versions respectively. 



Network |#words|#adjective|#noun|#edges (S)|#edges (M) 



David 


112 


57 


55 


569 


1494 


News 


376 


91 


285 


1389 


2411 


Brown 


23258 


6235 


17023 


66734 


88930 



Table 2. The matrices uj ts = m r3 /(n r n s ) for the most-likely block assignment accord- 
ing to the stochastic block model. 



David(S) 


David(M) 


News(S) 


News(M) 


Brown(S) 


Brown(M) 


0.039 0.118 
0.018 0.006 


0.080 0.358 
0.025 0.011 


0.010 0.015 
0.002 0.010 


0.012 0.028 
0.003 0.019 


9.1e-05 3.4e-04 
2.0e-05 8.8e-05 


l.le-04 4.4e-04 
2.4e-05 1.2e-04 
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4.1 Performance of oriented and degree-corrected models 

Table [3] compares the performance of non-degree-generated block models, in- 
cluding SBM, DC, ODC, and DDC. When applying DC, we ignore the edge 
orientations, and treat the graph or multigraph as undirected (note that the re- 
sulting network may contain multi-edges even though the directed one doesn't). 

In our experiments, we started with a random initial block assignment, ran 
the Kernighan-Lin (KL) heuristic to find a local optimum [T3], and then ran 
the heat-bath MCMC algorithm. We also tested a naive heuristic (NH) which 
simply labels a vertex v as an adjective if d° nt > d™, and a noun if > d° ut . 
If d° ut = d" 1 , NH labels v randomly with equal probabilities. 

Table 3. For each model and each network, we pick the block assignment with highest 
likelihood and compute its NMI with the correct block assignment. Each run consisted 
of the KL-heuristic, starting with a random block assig nment, followed by 10 6 MCMC 
steps. The results for "David" and "News" are based on 100 independent runs; for 
"Brown" , 50 runs are executed. The best NMI for each network is shown in bold. 





David(S) 


David(M) 


News(S) 


News(M) 


Brown(S) 


Brown(M) 


SBM 


.423 


.051 


.006 


.018 


.001 


7e-04 


DC 


.566 


.568 


.084 


.083 


.020 


.015 


ODC 


.462 


.470 


.084 


.029 


.311 


.318 


DDC 


.128 


8e-04 


.084 


.091 


.016 


.012 


NH 


.395 


.449 


.215 


.233 


.309 


.314 



For "David", DC and ODC work fairly well, and both are better than the 
naive NH. Moreover, the mistakes they make are instructive. There are three 
adjectives with out-degree zero: "full" , "glad" , and "alone" . ODC mislabels these 
since it expects edges to point away from adjectives, while DC labels them 
correctly by using the fact that (undirected) edges are disassortative, crossing 
from one block to the other. 

The standard SBM works well on "David(S)" but fails on "David(M)" be- 
cause the degrees in the multigraph are more skewed than those in the simple 
one. Finally, DDC performs the worst; by correcting for in- and out-degrees sep- 
arately, it loses any information that the edge orientations could provide, and 
even fails to notice the disassortative structure that DC uses. Thus full degree- 
correction in the directed case can make things worse, even when the degrees in 
each community are broadly distributed. 

For "Brown" , all these models fail except ODC, although it does only slightly 
better than the naive NH. For "News" , all these models fail, even ODC. Despite 
the degree correction, the most-likely block assignment is highly assortative, 
with high-degree vertices connecting to each other. However, we found that in 
most runs on "News", ODC used the edge orientations successfully to find the 
a block assignment close to the correct one; it found the assortative structure 
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only occasionally. This suggests that, even though the "wrong" structure has a 
higher likelihood, we can do much better if we know what kind of community 
structure to look for; in this case, disassortative and directed. 

To test this hypothesis, we tried giving the models a hint about the commu- 
nity structure by using NH to determine the initial block assignment. We then 
performed the KL heuristic and the MCMC algorithm as before. As Table [4] 
shows, this hint improves ODC's performance on "News" significantly; it is able 
to take the initial naive classification, based solely on degrees, and refine it using 
the network's structure. Note that this more accurate assignment actually has 
lower likelihood than the one found in Table[3]using a random initial condition — 
so NH helps the model stay in a more accurate, but less likely, local optimum. 
Starting with NH improves DCs performance on "Brown" somewhat, but DC 
still ends up with an assignment less accurate than the naive one. 



Table 4. Results using the naive NH assignment as the initial condition, again followed 
by 10 6 MCMC steps. This hint now lets ODC outperform the other models on "News" . 





David(S) 


David(M) 


News(S) 


News(M) 


Brown(S) 


Brown(M) 


SBM 


.423 


.051 


.006 


.021 


.001 


7e-04 


DC 


.566 


.568 


.084 


.015 


.160 


.155 


ODC 


.462 


.470 


.247 


.270 


.311 


.318 


DDC 


.015 


.060 


.084 


.005 


.005 


.070 


NH 


.395 


.449 


.215 


.233 


.309 


.314 



4.2 Performance of degree-generated models 

In this section, we measure the performance of degree-generated models on the 
Brown network, and compare them to their non-degree-generated counterparts. 
As Fig. [3] shows, the in- and out-degree distributions in each block have heavy 
tails close to a power-law. Moreover, the out-degrees of the adjectives have a 
heavier tail than those of the nouns, and vice versa for the in-degrees. This is 
exactly the kind of difference in the degree distributions between communities 
that our DG block models are designed to take advantage of. 

Setting 9 m i n — 1, we can estimate the parameters a and /? for these dis- 
tributions as discussed in Section [273) We show the most likely values of these 
parameters, given the correct assignment, in Table [5j 

As Table [6] shows, degree generation improves DC and DDC significantly, 
letting them find a good assignment as opposed to one with NMI near zero. For 
ODC, the performance improvement is slight, making DG-ODC the best model 
overall, but there is another effect. We compare performance starting with the 
KL heuristic to performance using MCMC alone. We see that degree generation 
gives ODC almost as much benefit as the KL heuristic does. In other words, it 
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Table 5. MLEs for the degree generation parameters in the Brown network, given the 
correct assignment. 







Brown(S) 






Brown(M) 




block 




C^out Pin 


Pout 


&in 


&out Pin 


Pout 


adjective 
noun 


2.329 
2.721 


2.629 0.161 
2.248 0.716 


0.527 
0.021 


2.136 
2.576 


2.326 0.161 
2.134 0.716 


0.527 
0.021 
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speeds up the MCMC optimization process, letting ODC find a good assignment 
without the initial help of the (computationally expensive) KL heuristic. 

Table 6. Performance of degree-generated vs. non-degree generated models. KL in- 
dicates that we applied the KL heuristic and then 10 MCMC steps, as opposed to 
MCMC alone. DC indicates degree generation. Each number gives the NMI for the 
most-likely assignment found in 50 independent runs. The best model is DG-ODC. 
Moreover, degree generation helps ODC converge, providing much of the benefit of the 
KL heuristic while avoiding its long running time (see bold numbers). 









Brown(S) 




Brown(M) 


DC 


ODC DDC 


DC 


ODC DDC 






.010 


.188 .008 


.007 


.203 .011 


KL 




.020 


.311 .016 


.015 


.318 .012 




DG 


.267 


.302 .213 


.278 


.310 .149 


KL 


DG 


.271 


.312 .225 


.284 


.320 .195 



5 Conclusions 

Degree correction in stochastic block models provides a powerful approach to 
dealing with networks with inhomogeneous degree distributions. However, in a 
sense it denies information to the inference process, since a generative model can 
only help us learn from the data that it has to generate. 

We have introduced two new kinds of block models that allow for broad or 
heavy-tailed degree distributions, while using the degrees to help us detect com- 
munities. The oriented degree-corrected model (ODC) performs partial degree 
correction, taking the total degrees as parameters but generating edge orienta- 
tions. The degree-generated (DG) models don't take the degrees as parameters, 
but assumes that they are generated according to some prior in each community. 

Unlike the directed degree-corrected (DDC) block model, which takes both 
in- and out-degrees as parameters, ODC is able to capture and account for 
certain correlations between the in- and out-degrees. Simply put, for ODC, two 
vertices are unlikely to be in the same community if one has high in-degree 
and low out-degree while another has high out-degree and low in-degree. If the 
network is highly directed or asymmetric, the edge orientations can help ODC 
find community structures that DDC fails to perceive. 

Our DG models use degree-corrected block models as a subroutine, but im- 
pose a penalty term based on the prior likelihood of the degree distribution in 
each community. They can take the (hyper)parameters of these priors as given, 
or infer them "on the fly." DG models achieve high accuracy even when the den- 
sity of connections between communities is close to uniform, as we illustrated 
in synthetic networks for small A. Augmenting block models, such as the ODC, 
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with degree generation also appears to speed up their convergence in some cases, 
helping simple algorithms like MCMC handle large networks without the benefit 
of expensive preprocessing steps like the KL heuristic. 

On the other hand, the effectiveness of DG depends heavily on knowing the 
correct form of the degree distribution in each community. Without some prior 
ground truth about the block assignment, or domain-specific knowledge, finding 
an appropriate family of degree distributions may be difficult for some networks. 

With all these variants of the block model, ranging from the "classic" version 
to degree-corrected and degree-generated variants, we now have a wide variety 
of tools for inferring structure in network data. Each model will perform better 
on some networks and worse on others. A better understanding of the strengths 
and weaknesses of each one — which kinds of structure they can see, and what 
kinds of structure they are blind to — will help us select the right algorithm each 
time we meet a new network. 
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A Maximum Likelihood Estimators for the directed 
degree-corrected (DDC) block model 

We maximize the log-likelihood function Q, 

log P(G | 9, u , g) = ]T« ut log 0T + cC log O 

U 

+ J2( mrs l °S^rs ~ < Ut 4 n UV s ) , (14) 
rs 

where we have imposed the constraints on the parameters 

J2 and £ C = 4 a - (15) 

u:g u =r u:g u =r 

For each block r, we associate Lagrange multipliers A° ut ,AJ, n with these con- 
straints. For each vertex u, taking the partial derivative of the log-likelihood 
with respect to 6° ut and ftjf gives 

Jout Jin 

^- = A° ut and = A' n . (16) 
To satisfy the constraints (15), we take A° ut = = 1 for all r, so that 



C* = < Ut and C = C- 
Setting the partial derivative of the log-likelihood function with respect to cu r 
to zero then gives 
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B Another view of the ODC model 

Here we show that the oriented degree-corrected (ODC) model is a special case 
of the directed degree-corrected (DDC) model. Recall that the ODC model first 
generates an undirected graph according to the DC model with parameters 9 U 
and cu rs , and then orients each edge (u, v) from u to v with probability p g ^,g v - 
The number of directed edges from u to v is then Poisson-distributed as 

A uv ~ Poi(6 u 8 v u;g ut g v p gu>gv ) . 

But if we write 

io' rs = u rs p rs , 

then 

Thus ODC is the special case of DDC where 0™ = 0° ut = 9 U for all vertices u. 

For completeness, we check that the two models correspond when we set 
these parameters equal to their MLEs. We impose the constraint V\ .„ __ 9 U = 
for all blocks r. Ignoring constants, the log-likelihood is then 

log P(G | 9,oj', g) =y^ d a log 9 U + y^(m rs logt^s - K r K s u)' rB ) , (17) 

u rs 

where d u = d° nt + d™. The MLEs for 9 U and u)' rs are then 

9 U = d u , u)' rs — — — . (18) 

Thus uj' rs = uj rs p rs where 

uj rs = - and p rs = - — , 



recovering (11) 



C Bayesian estimation for DG models 

Bayesian inference focuses on posterior distributions of parameters rather than 
on point estimates. In hierarchical models like DG-DDC, the full Bayesian pos- 
terior of the parameters (omitting the other parameters g and oj) is 

P(6\G) = J P{9\G,i>)P{i>\G)d^. 

Here we employ the Empirical Bayesian method, and use point estimates for the 
hyperparameters tjj, namely their MLEs ip, 

ip = argmax P{G \ ip) 

= argmax / P(G \ 9, i/j) P(9 \ d0 . (19) 
ip J 
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With this approximation we have 

= P(G\0)P{0\$) 
P(G | $) 
P(G\6)P(6\i>) 



f P(G\9,i>)P(6\i>)d6 ' 



(20) 



where we used Bayes' rule in the second line. 

Computing the posterior P{9 \ G) is usually difficult, as the integral in the 



denominator of (201 is often intractable. However, with a clever choice of the 
prior distribution P(9\tp), we can work out an analytic solution. It is called 
the conjugate prior of the likelihood term. We focus here on DG-DDC; the 
calculations for other degree-generated models are similar. 

Say that a random variable X is Gamma-distributed with parameters a, /3, 
and write X ~ r(a,P), if its probability distribution is 

r(a) 

In DG-DDC, the likelihood ^ can be written (where we have plugged in the 
MLEs for u, and substituted K° ut = Eu: ffll =r 

P{G | gout-, = njn d - ru<~ jj^^r exp (~c t ) • (21) 

If we assume that the 6 m and 8° ut for each u are independent, this is propor- 
tional to a product of Gamma distributions with parameters a = c?° ut + 1 and 
P = 1 for each (9° ut . 

A natural conjugate prior for Gamma distributions is the Gamma distri- 
bution itself. Let the hyperparameters ip° ut for each block r consist of a pair 
(a° ut ,p° ut ), and consider the prior 



""' - r(a°f,/3°f). 



That is, 



/ ooufja Ju 



Multiplying this prior by the likelihood (21) stays within the family of Gamma 



distributions, and simply updates the parameters: 
P(6° ut | G) oc P(6° ut I ?0 P ( G I O 

cx (CT- +drt_1 exp(-0° ut (/C + 1)) 
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Thus the posterior distribution is 



rial 



1 and /3°f 



0, 



Note that if we use a uninformative prior, i.e., in the limit a™* 
the Gamma prior reduces to a uniform prior. The maximum a posteriori (MAP) 
estimate of 6>° ut is 



(22) 



and similarly for 8™, just as we obtained for the MLEs in |5]). 

However, our goal is to integrate over 8, not focus on its MAP estimate. So let 
us continue the Bayesian analysis. Assuming the 8 parameters are independent, 
then their joint posterior is simply a product of their individual posteriors 



P(8\G) = Y[P(8^\G)P(8 i :\G) 



iout out i jout /gout , i\ ziin m , nn am , -i \ 

« ;a 9u +d u ,P 9u +1J f{8 u \oc gn +d u ,P giu + 1) 



(23) 



Then we can calculate the integral in ( 19 ) and ( 20 ) by the simple algebra 



P{G\8,ip)P(8\ip) d8 



P{G\8)P{8\^) 
P(8\G) 



(24) 



IL /(0S ut ; d° ut + i, i) C + i) f(®°u ut ; <*T> W f( 9 u\ <^ n J 



n„ /(0s ut ; «gf + ds ut , /3 3 r + 1) m 



Tin PT aZ ^""^(C + < Ut ) r (4" 



EL (# 



out 



1 



Urfout . 



1 



r(d^ + i) r«* + 1) r(o~t) r(a£) 



Now that the dependence of the numerator and denominator on 6* has cancelled 
out, the integral is a function only of the hyperparameters if), making it possible 



to do the point estimate of tp in ( 19 ). In our case, optimizing for -0 requires some 



numeric techniques, but it is nonetheless doable. 

Empirical Bayesian solution not only gives better approximation to the orig- 
inal problem, it also make it possible to integrate prior knowledge if available. 
On top of that, because the posterior is now a direct function of the hyperpa- 
rameters -0, we no longer have to worry about the Poisson noise when estimating 
ip indirectly from degrees. 

On a final note, the above result only holds for Gamma priors. With any 
other prior, the integral may not be this simple. 



D Power-law distribution with upper bound 



In this section, we show that imposing an upper bound on our power-law distri- 
butions in order to ensure a certain average degree does not appreciably change 
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the procedure of [6] for estimating the exponent. Suppose x is distributed as a 
power law lower bound x m - ln , upper bound a; max , and exponent a > 0. Then 

( \ a — 1 _ a 

P\<E) — 1 — a I—a 5 ^min — % — ^max ■ 

^min — ^max 

Given a random sample x = {x\, . . . , x n } drawn from this distribution indepen- 
dently, the likelihood function is 

f=l A min x max \ x min x ™/ j = l 

Thus, the log-likelihood is 

logp(x) = n (log(o - 1) - log (a47 n Q - x^)) ~ aX] log:ri • 

»=i 

Taking the derivative with respect to a gives 

9 log g(x) _ / 1 X^" log 0; m i n - log Z max \ A 

-to-- n l^T + S - ^ (25) 



Setting (25) to zero, we get 



1 ^min lp g ^min ~ log X max = E^jJogXj 

(T — 1 T 1_Q — T- 1_a T? 

Lx 1 d 'mm x max " 



(26) 



If x m i n = 1 and cc max — >■ oo, then solving ( 26 ) gives the MLE for a just as in ( 12 ). 



